0. Introduction - About Dataset

All scripts scraped from an online repository: http://www.chakoteya.net/DoctorWho/index.html

Ratings and runtimes retrieved from IMDB.

Information to match serialized parts with episodes, as well as information regarding writers and UK viewership numbers, taken from: https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963-1989)

Python code used for web crawling and initial data construction can be found at: https://github.com/LaurenceDyer/DocWho

Spanning roughly 60 years, Doctor Who is a collection of episodic, science fiction radio plays and television serials starring the eponymous “Doctor”, a humanoid alien from the planet Gallifrey. The Doctor explores the Universe, though mainly staying around/on Earth, with his longterm companions. Upon his death, The Doctor regenerates, and a new actor takes their place. As such, The Doctor is portrayed by no less than 13 actors over the shows’ history.

A 15-year intermission in production occurred between the years of 1990 and 2005, leading to many fans considering the series to be split between classic (Doctors 1 through 7) and modern runs (Doctors 9 through 13). Doctor 8 was portrayed only in a made-for-TV movie and as such is missing from this analysis.

The data set available is, accordingly, very large, with roughly 250000 lines of dialogue over roughly 330 episodes.

1. Data Cleaning and Exploratory Analysis

Before we get started, we must remedy the issue of classic episodes being recorded on IMDB according to their individualized parts/viewing slots, rather than as whole episodes. We’ll combine this data to generate single episodes of these parts, matching our script source and maintaining complete episode narratives for some of our downstream analysis, e.g. episode sentiment:

We can do this by counting how many parts each episode is assigned on wikipedia, and then use this to average out viewership and sum up runtime to get an individual rating and runtime for each episode:

The scripts that we are using are designed to be human readable, but do not necessarily lend themselves well to mass data analysis. They were also transcribed manually, and as such, many typos are present. The process of correcting for script-breaking typos (Such as typo’d dialogue syntax) has already been performed, however it is likely that among the 240,000 dialogue lines, an unknown number of artefacts exist.

The first major act of removing artefacts in our script lines has already been performed, largely in python. This process also involved the removal of several strings which defined stage direction cues or provided visual descriptions of events.

Let’s start with some very general data overviews to see if we can locate any major remaining artefacts.

Episodes and Writers - Overview

Let’s take a quick look at the evolution of the data we’ve crawled from wikipedia and IMDB. We can explore how runtime, rating and viewership have changed over time.

When it comes to episode runtime, we see two clear trends, both that the serialised classic era episodes have a far greater variation in length than the more restrained modern era, and that episode lengths for the modern era are growing in length as the newer seasons stretch on.

When it comes to rating we see that “Orphan 55”, an episode relating to climate action is undoubtedly the least popular Dr. Who episode going. And, in fact, all 5 of the lowest rated episodes are from the latest 3 seasons of the show, starring the 13th Doctor. Viewership numbers have dropped accordingly, dipping below the previous lowest all time record held by “Battlefield”, among other episodes from the final classic season.

Utopia - The first episode of the multi-episode season 29 finale, holds the title of highest rated episode. The episode features the return of the long-time series villain “The Master”. “City of Death” aired at prime-time in the middle of a workers’ strike that would take ITV, the BBC’s main source of competition, off air for several weeks.

Let’s see which writers were the most popular over the series’ run:

Steven Moffat and Russell T. Davies prove to be some of the most popular writers by episode rating, while also being the longest lasting writers by episode number at 48 and 31 episode credits respectively.

Character and Location - Overview

To get a sense of how deep some of our script-input probles run, we’ll need to examine the data and try to get an overview of the potential data cleaning we have to perform. Let’s start by examining our two most rigid columns, “Character” and “Location”.

Character Frequency Location Frequency
DOCTOR 58665 [Tardis] 13713
CLARA 3635 [Control room] 4507
JAMIE 3374 [Corridor] 3532
IAN 2956 [Laboratory] 2509
SARAH 2952 [Spaceship] 2473
BRIGADIER 2809 [Tunnel] 2003

Looks great! What’s more iconic than the doctor and his TARDIS?

However, we can assume that there may be many errors lying lower down this frequency list. Let’s see how many characters and locations appear only once - They are quite likely to be recorded in error.

Wow! That doesn’t look too bad at all. Of course, we have many locations that do appear only briefly in the show, such as “Great Wall of China 1904”. In our data source, locations are always bounded by “[” and ”]” so they are easy to pick out and rarely made in error.

Numbered Characters

Some background characters, particularly aliens, are often listed as “DALEK1” or “ZYGON2”. We would rather tabulate these characters together going forward, and aggregate all of these into their alien race, or profession, unless the character is specifically named.

For The Doctor, who may appeared numbered if, e.g., DOCTOR10 turns up in a flashback during a DOCTOR12 episode, and robots like “K9”, we’ll ignore this step.

Before:

## 
##  DALEK2  DALEK1  DALEK3 MONOID1 MONOID2 
##     380     241     139      94      82

And after:

## 
##    DALEK   MONOID  DRAHVIN CYBERMAN     TECH 
##      800      244       99       91       66

Double-ups

One thing we can note about the scripts is that we occasionally see characters speaking at once, i.e., “DOCTOR-AND-ROSE:”. We can either ignore these, or we can duplicate each script and location, creating a row for each character. Let’s give that second one a try and while we’re here, why don’t we run some quick analysis on those double lines as there are so few of them - Who speaks together most often?

In S30E10, the character SKY, once possessed, immediately repeats the words of those around her. Makes sense that her and the Dr. dominate these overlaps.

We ought to also correct for the fact that “DALEK” and “DALEKS” are roughly interchangeable for our purposes, so let’s see if we can’t combine these characters, as well as all the other species of aliens and professions that appear as both singular and plural.

The Most Verbose characters

We can easily tabulate the scripts to see which characters speak the most. Let’s start by looking at those characters which speak the most over the series’ 60-year run.

Because we have a total of 3005 characters, we’ll need to cut things down immediately for the plot to be interesting. Let’s try both with and without the Doctor, as we expect this character to dominate. We’ll take the top 50 in each case.

Normalization

Pretty interesting, but we know that characters appear in wildly different numbers of episodes - The Doctor appears in 334 episodes, “MAN” appears in 131, while Clara only appears in 39! Let’s see the most verbose speakers again after normalizing for their episode number. Let’s also calculate a few other basic stats - A lot of the characters with the most lines appear in only one episode, so we’ll limit some of our plots just to them.

We can also calculate which characters have the most words per episode, and the most words per line (And the longest monologues).

POLO wins the most lines per episode what a screenhog! Clocking in at ~330 lines in his only episode, Marco Polo had the most individual lines of any character. The doctor doesn’t quite make the cut, but after limiting to only repeat characters, we see that not only is the doctor the most prolific character, in 99% of episodes, he’s also the most verbose repeat character.

Words per line gives us some interesting responses - SINGER and MUSIC, ANDREWMARR, TV, NARRATOR and NEWSMAN are all big winners here.

The Many Doctors

The doctor them self, all 13 (12) of them. Over the course of the series, many actors and writers have taken a stab at writing the doctor. How has this changed over time? Are older doctors more verbose? Do newer doctors get more lines? Lets see if we can find any clear trends.

And let’s take a quick peek at which doctors were the most popular with viewers:

DOC10 - David Tennant takes the top spot, with none of the classic doctors except DOC4 - Tom Baker coming close.

DOC13 - Jodie Whittaker, is certainly not a fan favourite, with the average DOC13 episode having a score that would be considered low for most previous docs.

2. Network Analysis

One thing we can easily generate from our data is a list of all characters and the number of times that they occur together within the same scene.

By counting the number of these interactions, we can construct a network using the R package igraph which will attempt to draw a network connecting each character. For this analysis we’ll stick to using mainstay repeat characters with a large number of lines and we’ll perform the analysis split between the classic and modern eras, as we know that characters from the two do not interact (Except in some very, very rare cases). Potentially characters who are separate do share names, particularly characters like “MAN”, but also some repeat characters such as “HARRY” - This is a largely unavoidable problem and the only solution is extensive watching of each episode and manually editing the script files.

Clustering

We can view such character-character interactions in different ways - We can use clustering, or we can attempt to view an overall network. Let’s see how both look.

This looks great! We can see all of the doctors and their mainstay companions stick together very closely.

Network

Let’s see if we can use the number of interactions as weights to drive a network graph of each series using igraph.

We’ll colour each node by their relevant role - Yellow for the doctor, blue for companions, red for villains and green for miscellaneous characters that don’t quite fit into any other category.

The network analysis gives us some pretty interesting conclusions, such as doctors and their companions sticking together strongly, the two primary Villains are both very central to the series, with both the master and the daleks centrally connecting all of the doctors with eachother. We also see an increased presence of miscellaneous characters in the modern series, as well as signs that some companions connect individual doctors (As they are carried over from one transformation to another), such as Clara, Sarah and Rose.

Gephi network analysis

If we take all of the series episodes together as one, we are a bit overwhelmed with characters. Perhaps we can achieve a more detailed graph utilising the external software package Gephi. Gephi will allow us to quickly perform community analysis for our characters and re-colour our communities accordingly, it also provides a nice GUI to play around with many graphical settings.

Here we will colour each community individually, the size of the connecting lines will be proportional to the total number of interactions between two characters and the size of each character name will be proportional to their Page Rank, a measure of cluster centrality.

Gephi Network (All Episodes)

Gephi Network (All Episodes)

This turned out to be a really effective way of visualising the over-arching character interactions of the series.

Firstly, we see the central role that repeat villains play through the series, with the DALEKs, the CYBERMEN, DAVROS and the MASTER being the only characters which connect all distant clusters, and being some of the most central characters by page rank.

Our community analysis delineates each doctor and their companions very well, with both Doctor13 and Doctor7 being clearly separated from all other characters. Interestingly, we see that Doctors 11 and 12, Doctors2 and 6 and Doctors 9 and 10 belong to the same community, as their companions carried over between “regenerations”.

A clear distinction between classic and modern eras is also visible, with the three modern era clusters separating to the left.

Episode Overlap Timeline

By tracking which characters appear together in each episode, we can construct an episode overlap timeline.

Looks good! We can see how long each companion lasts and which other characters they frequently overlap with. It’s also clear when previous doctors pop back up for a quick re-appearance. It’s also clear to us where some episodes contain many more characters than others. Neat.

3. Textual Analysis

Character Wordclouds

One of the most common ways to get an overview of text-based data is to create word clouds, where the most frequent words in a script are represented graphically, with the size of each word corresponding to how frequently that character uses it.

Let’s try and see if we can create an R function that will generate a word cloud for a given character. We’ll be relying heavily on the package “tm” to achieve this.

wCloud <- function(character_name){
  
    characterLines <- allEps_for_docs %>%
                        filter(Character == character_name) 
    
    characterWords <- VCorpus(VectorSource(characterLines$Script))
    
    #Now, we'll want to clean this text in a few ways, such as removing numbers, "stop" words (Such as can, and, the), remove any and all punctuation and remove all the white space and capital letters
    
    characterWords_c <- characterWords %>%
      tm_map(content_transformer(tolower)) %>%
      tm_map(removeNumbers) %>%
      tm_map(removePunctuation) %>%
      tm_map(stripWhitespace) %>%
      tm_map(removeWords, c(stopwords_en,stopwords("english"),"just","thats","dont","got","can","now","one"))
      
    #Now that we have clean text, the next step is to generate a document-term matrix 
        
    character_term_mat <- as.matrix(TermDocumentMatrix(characterWords_c))
    words <- sort(rowSums(character_term_mat),decreasing=TRUE) 
    
    character_df <- data.frame(word = names(words),freq=words)
    
    wordcloud(words = character_df$word, freq = character_df$freq, min.freq = 1,
              max.words=200, random.order=FALSE, rot.per=0.35,
              colors=brewer.pal(8, "Dark2"), scale = c(3,0.25), main = character_name)
    
}

Let’s take look at how some of those turned out. Let’s check DOC9 and his companion, ROSE:

Great! But other than the reference to “ROSE”, it’s unlikely we’d really be able to tell these two word clouds apart from any other character.

TF-IDF

If we want a deeper analysis of character-specific language, we might want to process our input text a little further, and use “Inverse Document Frequency” to try and find words that one character says often, but are not otherwise frequently said throughout the rest of the text.

One extra step we will perform is to reduce words to their individual wordstems, for example: Dancing -> Dance and Houses -> House

For this slightly heavier duty word processing, we will rely on another R package, “quanteda”.

As we scroll though these columns, we see that “Doctor” and “Time” basically come up for most characters.

We do get some nice, character-specific hits regardless, such as “Exterminate”, “Obey”, “Prison” and “Locate doctor” for the Daleks, “Power” and “Kill” for The Master, but let’s pursue the inverse document frequency strategy.

Removing character names from this list is a tough decision - Really, this just gives us information about who a character spends time with, but it does mean we might inadevertently remove terms like “Dalek”, which might be interesting.

This is achieved easily via quanteda’s tfidf function.

Interesting! We see “haroon” pop up from DOC3’s attempts at speaking an alien language, we see “Shush” from DOC1’s attempts to keep his companions quiet. DOC5 references the tardis most often. Looking down the list, we see more classic words - “Spoiler” and “SWeetie” for River. Rose, Martha and Donn all have the word “god” in their lists. Susan has “grandfather”, her common nicknme for the doctor, etc. Looks good!

Sentiment Analysis

The most common form of